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Abstract 

Interruptions of microsatellite sequences impact genome evolution and can alter disease manifestation. However, human 
polymorphism levels at interrupted microsatellites {iMSs) are not known at a genome-wide scale, and the pathways for 
gaining interruptions are poorly understood. Using the 1000 Genomes Phase-1 variant call set, we interrogated mono-, di-, 
tri-, and tetranucleotide repeats up to 10 units in length. We detected ~26,000-40,000 ilVlSs within each of four human 
population groups (African, European, East Asian, and American). We identified population-specific ilVlSs within exonic 
regions, and discovered that known disease-associated ilVlSs contain alleles present at differing frequencies among the 
populations. By analyzing longer microsatellites in primate genomes, we demonstrate that single interruptions result in a 
genome-wide average two- to six-fold reduction in microsatellite mutability, as compared with perfect microsatellites. 
Centrally located interruptions lowered mutability dramatically, by two to three orders of magnitude. Using a biochemical 
approach, we tested directly whether the mutability of a specific IMS is lower because of decreased DNA polymerase strand 
slippage errors. Modeling the adenomatous polyposis coli tumor suppressor gene sequence, we observed that a single base 
substitution interruption reduced strand slippage error rates five- to 50-fold, relative to a perfect repeat, during synthesis by 
DNA polymerases a, (3, or r|. Computationally, we demonstrate that ilVlSs arise primarily by base substitution mutations 
within individual human genomes. Our biochemical survey of human DNA polymerase a, p, 8, k, and r| error rates within 
certain microsatellites suggests that interruptions are created most frequently by low fidelity polymerases. Our combined 
computational and biochemical results demonstrate that iMSs are abundant in human genomes and are sources of 
population-specific genetic variation that may affect genome stability. The genome-wide identification of iMSs in human 
populations presented here has important implications for current models describing the impact of microsatellite 
polymorphisms on gene expression. 



■0-PLOS I GENETICS 



Citation: Ananda G, Hile SE, Breski A, Wang Y, Kelkar Y, et al. (2014) Microsatellite Interruptions Stabilize Primate Genomes and Exist as Population-Specific Single 
Nucleotide Polymorphisms within Individual Human Genomes. PLoS Genet 10(7): el004498. doi:10.1371/journal.pgen.l004498 

Editor: Takashi Gojobori, National Institute of Genetics, Japan 

Received November 22, 2013; Accepted May 28, 2014; PublisKied July 17, 2014 

Copyriglit: © 2014 Ananda et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits 
unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. 

Funding: This research was supported by a Multiple Principal Investigator award from the National Institute of General Medical Sciences [grant GM087472 to 
KDM and KAE] and by the Penn State Clinical and Translational Science Institute. Additional funding was provided, in part, under a grant with the Pennsylvania 
Department of Health using Tobacco Settlement Funds. The Department specifically disclaims responsibility for any analyses, interpretations or conclusions. The 
funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. 

Competing interests: The authors have declared that no competing interests exist. 

* Email: kdml6@psu.edu (KDM); kae4@psu.edu (KAE) 

na Current address: Computational Sciences, The Jackson Laboratory, Bar Harbor, Maine, United States of America 

ab Current address: Department of Biostatistics and Computational Biology, University of Rochester, Rochester, New York, United States of America 



introduction 

Over 3% of the human genome consists of microsateUites, 
defined as short tandem repeats of 1-6 bases per motif unit, 
interspersed throughout the genome [1]. Strand slippage during 
DNA synthesis is facilitated by the presence of tandem repeats, 
and has been proposed to be the dominant mutational mechanism 
for microsatellites [2,3]. Perfect microsatellites contain repeats of a 
single motif sequence, whereas interrupted microsatellites (iMSs) 
include tandem repeats of a single motif interrupted by other 
bases. Many microsatellites are located within coding and 
regulatory sequences [4] , and can be important modifiers of gene 
expression, affecting transcription rate, RNA stability, splicing 



efficiency, and RNA-protein interactions [5-7]. Because micro- 
satellite alleles are highly polymorphic, they may provide a large 
pool of heritable, phenotypic variants for subsequent selection [8- 
10]. Length variation at certain microsatellites contributes to 
natural variation in brain development and behavioral traits [1 1], 
and may modulate neurodegenerative disease risk [12]. 

Microsatellite interruptions also are known to have important 
consequences for human health and disease. For instance, 
germline interruptions of disease-causing microsatellite alleles act 
as a disease modifier for spinocerebellar ataxia type 10 [13], and 
alter the age of onset of spinocerebellar ataxia type 1 [14]. 
Importandy, the presence of interrupted alleles at the FMR gene 
(Fragile X syndrome) microsatellite diminishes the likelihood of 
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Author Summary 

Microsatellites are short tandem repeat DNA sequences 
located throughout the human genome that display a 
high degree of inter-individual variation. This characteristic 
makes microsatellites an attractive tool for population 
genetics and forensics research. Some microsatellites affect 
gene expression, and mutations within such microsatellites 
can cause disease. Interruption mutations disrupt the 
perfect repeated array and are frequently associated with 
altered disease risk, but they have not been thoroughly 
studied in human genomes. We identified interrupted 
mono-, di-, tri- and tetranucleotide MSs {iMS) within 
individual genomes from African, European, Asian and 
American population groups. We show that many iMSs, 
including some within disease-associated genes, are 
unique to a single population group. By measuring the 
conservation of microsatellites between human and 
chimpanzee genomes, we demonstrate that interruptions 
decrease the probability of microsatellite mutations 
throughout the genome. We demonstrate that iMSs arise 
in the human genome by single base changes within the 
DNA, and provide biochemical data suggesting that these 
stabilizing changes may be created by error-prone DNA 
polymerases. Our genome-wide study supports the model 
in which iMSs act to stabilize individual genomes, and 
suggests that population-specific differences in microsat- 
ellite architecture may be an avenue by which genetic 
ancestry impacts individual disease risk. 

repeat-expansion to disease length alleles in the next generation 
[15,16]. Similarly, the presence of multiple interruptions at the 
DM-1 gene microsatellite decreases the probability of both 
germline and somatic expansions [17,18]. Furthermore, a 
population-specific, single nucleotide polymorphism within the 
APC gene coding region converts an lMS (AAATAAAA) to a 
perfect microsatellite (A)jj, leading to an increased risk of somatic 
APC mutation and colorectal cancer in Ashkenazi Jews [19]. 
Biomedical interest in microsatellite interruptions has been 
renewed recently by the demonstration that iMSs within the 
ATXN2 (SCA2) gene are associated with a different disease 
presentation than perfect expanded alleles [20]. These studies 
demonstrate that a complex relationship exists between microsat- 
ellites and disease, that involves not only length but also sequence 
polymorphisms. Importantly, iMSs might represent a reservoir of 
mutable alleles that can expand in subsequent generations, as was 
shown for SCA2 [21] and myotonic dystrophy type 2 [22]. 

Microsatellite interruptions are major contributors to the 
microsatellite life cycle. According to the life cycle hypothesis, a 
microsatellite locus undergoes stages of birth, adulthood and death 
during its evolution [23]. Microsatellites are "born" from short 
tandem repeats (proto-microsateUites) when they reach a threshold 
length that alters their mutational behavior [24,25]. Microsatellites 
display a characteristically high frequency of motif-based inser- 
tion/ deletion (indel) mutations that drive high germline microsat- 
ellite mutation rates; this is in contrast to proto-microsatellites that 
have lower indel mutation frequencies than microsatellites [25,26] . 
Microsatellites "die" when the length of the tandem repeat falls 
below the threshold, and interruptions are the major cause of 
microsatellite death [27,28]. Some interruptions can persist for 
millions of years (MYs), e.g., for 19-35 MYs at one locus studied 
in artiodactyls [29]. These features can serve as an advantage 
when using lMSs as markers in population genetics, since 
interrupted repeats exhibit lower homoplasy than uninterrupted 
MSs. Indeed, for iMSs, the probability of acquiring an interrup- 



tion by two independent events (i.e. the probability of a 
homoplasy) is much lower than the probability of inheriting this 
interruption from a common ancestor. Because of this, lMSs might 
be more appropriate markers than perfect microsatellites for 
studying population differentiation [30]. Interrupted microsatel- 
lites are more stable genetically (less mutable, but still polymor- 
phic) than perfect repeats in natural chicken populations [31], and 
interruptions can reduce the mutability of specific microsatellite 
sequences [32-34]. However, the quantitative effects of interrup- 
tions on decreasing human microsatellite mutability have never 
been evaluated previously in a genome-wide study. 

The significant role of iMSs in modifying the clinical 
manifestations of disease and their important contributions to 
genome evolution warrant a detailed understanding of iMSs. 
Specifically, the architecture of human genomes with regard to 
iMSs has not been previously investigated, and the mechanism by 
which interruptions arise has not been extensively studied. We 
used a multi-disciplinary approach combining computational and 
biochemical methods to address three biologically important 
questions regarding microsatellite interruptions. First, what is the 
quantitative effect of microsatellite interruptions on microsatellite 
mutability genome-wide? Second, how common are microsatellite 
interruptions within the human genome, where do they occur, and 
how often are human populations polymorphic for the presence/ 
absence of interruptions? Third, what are the possible biochemical 
pathways giving rise to microsatellite interruptions? Our results 
reveal the highly dynamic nature of microsatellite mutagenesis in 
the human genome, one that includes a robust level of interruption 
variation, and demonstrate that iMSs provide a source of 
population-specific genetic modifiers potentially affecting the 
stability of individual human genomes. 

Results 

Reduction in microsatellite mutability due to 
interruptions 

To understand the impact of microsatellite interruptions on 
human genome stability, we first set out to determine the genome- 
wide magnitude of microsatellite mutability reduction due to the 
presence of interruptions. For this analysis, we studied high-quality 
primate genome alignments using a comparative genomics 
approach. Mono-, di-, tri- and tetranucleotide microsatellites 
above the threshold repeat number were identified in human, 
chimpanzee, orangutan, macaque, and marmoset reference 
genomes (Table SI; penta- and hexanucleotide microsatellites 
were omitted due to their lower abundance and algorithmic 
difficulties in specifying all possible interruptions). iMSs were 
identified as microsatellites in which at least one perfect repeat 
stretch extended beyond the threshold repeat number. An 
interruption was required to be shorter than or equal to the 
microsateUite's motif size. For each of the five primate genomes 
examined, iMSs were more abundant than perfect microsatellites 
(Table SI). When only orthologous iMSs with one or two 
interruptions were considered (see Materials and Methods for 
details), iMSs numbered from 6,000-38,000, while perfect 
microsateUites numbered from 8,000-48,000, depending on the 
primate genome analyzed. 

The mutability, or the average squared difference in repeat 
number (allele length) between two species [35], was contrasted for 
all perfect versus interrupted microsateUites present in human- 
chimpanzee genomic alignments. Namely, we performed a 
genome-wide comparison of the mutability of microsateUites with 
the same repeated motif that were perfect in both human and 
chimpanzee to that of microsatellites that were interrupted (with 
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the same interruption(s)) in both of these species. For microsat- 
ellites of all motif sizes examined, short microsatellites with one 
interruption were less mutable than perfect microsatellites with the 
same overall repeat number (Figure lA). The average, genome- 
wide mutability difference for mononucleotides was ~two-fold at 
12 repeat units, and up to ~six-fold for di-, tri-, and tetranucle- 
otide microsatellites with 6, 5, and 4 units, respectively. 
Microsatellites with two interruptions were, on average, one to 
two orders of magnitude more stable than uninterrupted 
microsatellites with the same repeat number (Figure lA). The 
mutability difference between perfect and IMS loci was highest at 
shorter repeat numbers for all motifs. Thus, the quantitative effect 
of a single interruption on an individual microsateUite locus can be 
substantial. For example, more centrally located interruptions 
have a strong effect on mutability, dramatically lowering 
microsateUite mutability up to two to three orders of magnitude, 
whereas interruptions located on the microsateUite fringes have 
only a marginal c-fFect (Figure IB). The identity of the interrupting 
base has a non-significant effect on mononucleotide microsateUite 
mutabUity (Figure SI). 

MicrosateUite interruptions in human populations 

Armed with the knowledge that interruptions significandy 
stabUize microsateUites genome-wide, we next examined individ- 
ual human genome microsateUites for the presence of interruption 
polymorphisms. We found such polymorphisms to be highly 
abundant and informative for predicting population-specific 
microsateUite stabUization. In this analysis, we identified 
1,814,151 perfect mono-, di-, tri-, and tetra-nucleotide microsat- 
eUites above the threshold length within the reference human 
genome (UCSC buUd hgl9) [25]. Here, we imposed an upper 
limit on the microsateUite lengths analyzed (10, 9, 8, and 7 units 
for mono-, di-, tri- and tetranucleotide repeats, respectively), 
because we found next generation sequencing data at longer 
repeats to be biased due to sequencing errors and/ or read-length 
limitations [25]. For microsateUites that are perfect in the 
reference genome, we analyzed the frequency of iMSs within 
four human population groups (African, European, East Asian, 
and American), using the 1000 Genomes Phase- 1 variant call set 
[36]. Interruptions were defined as single nucleotide polymor- 
phisms (SNPs) or indels leading to a sequence within the 
microsateUite that differs from the fuU motif unit. AU indel and 
SNP variants (with aUele frequency SO.05) were identified, and 
considered to be interruptions if they were located within a 
microsateUite but not at the starting/ending repeat unit. In this 
manner, we identified ~ 26,000-40,000 polymorphic iMSs, 
depending on the population group (Table 1, Figure 2A; Datasets 
SI, S2, S3, S4, S5). A substantial number of interrupted aUeles 
were present in aU four population groups with different aUele 
frequencies, corresponding to a fixation index (Fsi) of 0.061 
(range: 0.000-0.590; sd: 0.062; median: 0.041), which falls weU 
within the range of SNP FgT values (0.052-0.083) derived from 
pair-wise population comparisons of the 1000 Genomes Phase- 1 
project [36] (Dataset S6). Despite such low observed average level 
of population differentiation, numerous interruptions were shared 
by two or three population groups, or unique to a single 
population group (referred to as 'population-specific' interruptions 
henceforth)(Figure 2A). The greater number of interruptions 
within Africans compared to other population groups is likely due 
to a higher number of the 1000 Genomes variants in Africans, 
reflecting their high diversity [36,37]. We also identified genes that 
encode polymorphic exonic iMSs. Among the four population 
groups studied, ~ 3,000-4,000 genes contained polymorphic 
interruptions within exonic microsateUites (Table 2). Several genes 



encoding exonic IMS aUeles are specific to only one population, or 
are shared by two or three populations (Figure 2B; Dataset SI, S2, 
S3, S4, S5). These data demonstrate that iMSs can provide an 

abundant source of population-specific alleles potentially stabiliz- 
ing individual genomes by lowering microsateUite mutation rates. 

Functional consequences of exonic iMS alleles 

We performed more in-depth analyses of the polymorphic 
exonic iMSs identified above in the four human population groups 
to determine the potential functional impact of IMS presence on 
genome function. Only a few of the iMSs identified are predicted 
to cause frameshifts or nonsynonymous mutations (Figure S2, 
Table S2); the vast majority of population-specific interruptions 
are not expected to alter protein sequence. Thus, the primary 
effect of IMS may be to modulate the mutation rate of the 
underlying microsateUite. To gain further insight into the potential 
biological relevance of the iMSs, we performed Gene Ontology 
(GO) analyses for each set of genes encoding population-specific 
iMS aUeles. The significantly (p<O.Ol) enriched GO terms are 
distinct for each population. For example, the GO terms enriched 
in the African-specific iMS genes included several neurological 
and organ development terms (Table S3), whUe those for the 
European-specific iMS genes were predominantiy immunological 
terms (Table S4). Since the GO vocabularies are structured such 
that they can be queried at different levels, we examined the 
smallest sized GO terms, identified the associated genes containing 
the iMS, and queried these genes for clinical associations using 
Online Mendelian Inheritance in Man (www.omim.org). Several 
genes that we identified in this manner are associated with famUial 
disease or disease susceptibUity (Table 3). For example, we 
discovered three, African-specific interrupted mononucleotide 
microsateUites within the HTT (Huntington's) gene, which 
correspond to perfect microsatellites in European, Asian and 
American populations (Table 3). It is important to bear in mind 
that although the genes identified by this analysis are implicated in 
disease, the associated microsateUites have not been shown to play 
a causal role. Therefore, these iMSs wUl have to be studied further 
for their potential role in modulating disease risk. 

We also examined polymorphisms in 15 genes containing 
exonic (coding and UTR) lMS aUeles that are weU known to be 
associated with microsateUite expansion diseases [38]. Eight loci 
(4RX, CBFAl, FMRl, FMR2, H0XA13, OPMD, SCA3, and 
ZIC2) contained no differences in microsateUite sequence from the 
reference genome in any of the four population groups studied. 
Four genes (AIBl, SCA2/ATXN2, SCA17, and H0XD13) 
contained IMS alleles that differed from the reference genome 
sequence, and the variants were present in aU four population 
groups at differing aUele frequencies (Table 4). For some loci/ 
populations, the reference genome sequence is not the major aUele 
(e.g., SCA17). The genetic consequences of the iMS variants 
include both sequences that are expected to increase mutabUity, 
and sequences expected to decrease mutabUity. For example, the 
H0XD13 variant iMS aUele is expected to have lower mutabUity 
than the reference genome iMS due to the presence of a third 
interruption that decreases the perfect tandem (GCG)5 repeat to a 
length below the mutabUity threshold (four units for trinucleotide 
repeats [25]). The frequency of this triply-interrupted allele varies 
from 0.76 in the African population to 0.26 in European and 
American populations. The AIBl locus contains four alternative 
iMS alleles present at varying frequencies among the populations, 
one of which is a doubly interrupted aUele, leading to greater 
stabilization of the repeat due to disruption of the (CAG)s array. 
For three loci {DRPLA, SCAl, and F0XL2), we observed 
instances of population-specific iMS aUeles. DRPLA contained 
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Figure 1. Effect of interruptions on microsatellite mutability in primate genomes. (A) Mutability of perfect (pure) microsatellites and that 
of microsatellites with one or two interruptions. (B) IVIutability of perfect (pure) microsatellites and that of microsatellites with single interruptions that 
were located within the middle 25%, or in the fringe 25% (at either 5' or 3' end) of the microsatellite length. The number of repeats of a microsatellite 
was calculated by dividing the total length of the microsatellite, excepting the interrupting nucleotides, by the size of its repeating motif. At each 
repeat number the lines designate the 2.5th and 97.5th percentiles of empirical distributions that were obtained through bootstrap resampling. The 
repeats are binned based on their repeat number in the human genome (the reciprocal operation, when binning was based on repeat number in 
chimpanzee, did not change the results). 
doi:10.1371/journal.pgen.1004498.g001 



variant alleles in only two of the four population groups studied 
(African and American), both of which decrease the number of 
interrupting bases, relative to the reference genome, potentially 
increasing mutability of the repeat. Finally, we noted an increased 
number of interruptions within polyglutamine repeats compared 
to polyalanine repeats, consistent with previous observations about 
the high propensity of polyglutamine repeats to acquire length and 
nucleotide polymorphisms [39]. 

Interrupted alleles: Heterozygosity and linkage 
disequilibrium 

Low indel mutation rates of lMSs (Figure 1) also are expected to 
be reflected in their low indel polymorphism levels. To test this, we 
investigated the levels of heterozygosity and the presence of linkage 
disequilibrium (LD) between interrupted microsatellite alleles 
caused by indels and neighboring, population-matched SNPs 
from the 1000 Genomes Phase- 1 data. Approximately 30^0% of 
lMSs display low levels of heterozygosity (below 0.2; Figures S3A- 
D). In fact, we observed a skew towards lower heterozygosity for 
iMSs as compared to that for perfect microsatellites (p = 0.028 for 
Asians; /) = 0.066, = 0.057, and ^ = 0.072 for Africans, Ameri- 
cans, and Europeans, respectively; Kolmogorov-Smirnov test). 

In each of the four populations studied, 4,400 to 5,000 
interruption-causing alleles (36-49% of the alleles investigated) 
were found to be in moderate LD (R^>0.80) with SNPs (Figure 
S4, Table S5), and 686 to 990 alleles (6-10%) were in perfect LD 
(R^ = 1) with SNPs. Interestingly, certain interruption alleles 
displayed perfect LD in some, but not all, populations (Table 
S5). Generally, IMS alleles in the African population displayed 
lower levels of LD compared to the other three populations (Figure 
S3), likely due to the abundance of low-frequency variants in 
Africans compared to non-African populations [36] . 

The exonic lMSs in perfect LD with neighboring SNPs were 
examined in more detail. Within each population, 6 to 1 1 of such 
alleles were identified (Table S6). For each allele, we examined the 
phenotype and disease relationships of the linked SNPs using 



SNPnexus web browser [40-42], and found associations with 
cancer, neurological, immune, cardiovascular, and metabolic 
disorders (Table S6). These associations reiterate a potential for 
iMSs to modulate disease risk in a population-specific manner. 

A case example: Mutability of an exonic IMS associated 
with colorectal cancer 

We sought to directly verify the quantitative effect on mutability 
of a single base substitution interruption within an exonic 
microsateUite encoded within a human disease gene. We chose 
the well-established biological model of a population-specific IMS 
encoded within the APC tumor suppressor gene. In 6% of the 
Ashkenazi Jewish population, a centrally located IMS (AAA- 
TAAAA) within an exon of the APC gene is present in the 
germline as a perfect Ag microsatellite (AAAAAAAA); this 
nonsynonymous SNP leads to an I1307K variant, but has no 
effect on APC protein function [19]. Nevertheless, this population 
has a greater chance of producing an inactive APC gene in 
somatic tissues, which increases the risk of colorectal cancer [43]. 
The proposed mechanism accounting for this observation is the 
enhanced somatic mutability of the perfect Ajj sequence, relative to 
the interrupted sequence [19,44]. We modeled the germline 
sequences of the perfect and interrupted APC microsatellites, and 
measured DNA polymerase strand slippage error rates using our 
established in vitro assay. Briefly, in this analysis, defined tandem 
repeat sequences are inserted in-frame within a reporter gene. 
Vectors containing these reporter cassettes are used as templates 
for in vitro DNA synthesis reactions, and DNA polymerase errors 
that result in gene inactivation (frameshift, nonsense or missense 
mutations) are scored by genetic selection in E. coli [45,46]. To 
determine the specificity of polymerase errors, independent 
mutants are isolated, and the DNA sequence changes within the 
reporter region are determined by dideoxy DNA sequence analysis 
of purified vector DNA [47] . 

For these experiments, we examined three DNA polymerases, 
representing distinct polymerase families and postulated to be 



Table 1. The number, types, and consequences of polymorphic iMS loci genome-wide in the 1000 genomes Phase-1 dataset, by 
population group. 





Population 


Number of Tandem Repeats 








Total loci* 


SNP Interruption 




Indel Interruption 






Death" 


Degeneration*^ 


Death 


Degeneration 


African 


39,105 


25,746 


3,971 


7,304 


2,906 


European 


27,416 


17,707 


2,786 


5,327 


2,116 


Asian 


25,939 


16,617 


2,582 


5,227 


2,034 


American 


28,522 


18,386 


2,912 


5,584 


2,203 



^The number of loci is less than the total number of SNP and Indel Interruptions because some loci contain multiple interruptions. 
'^Interruptions resulting in IMS below the threshold length. 
'^Interruptions resulting in IMS at/above the threshold length. 
doi:1 0.1 371/journal.pgen.1 004498.1001 
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A. Interrupted Microsatellites (IMS) 




1037 



3037 



B. Genes encoding IMS 
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Common to 2 populations 
Common to 3 populations 
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324 



Figure 2. Distribution of interrupted microsatellites in Kiuman 1000 genomes populations. Venn diagram depicting (A) numbers of 
Interrupted microsatellites (IMSs) across the four populations genome-wide, and (B) numbers of genes with IMSs In exons. Tan, blue, green, and red 
ellipses represent African, European, Asian, and American populations, respectively. Numbers In blue, red, maroon, and black represent counts of 
population-specific IMSs (absent In the other three), IMSs shared between two populations (and absent In the other two), IMSs shared between three 
populations (and absent In the fourth), and IMSs common to all populations, respectively. 
dol:10.1371/journal.pgen.1004498.g002 



required for distinct genome maintenance functions: Pol a, DNA 
replication; Pol (5, DNA repair; and Pol r|, translesion .synthesis. 
The accuracy of each polymerase was measured on four DNA 
templates, representing the complementary strands of the perfect 
(Ab and Ta) and iMS (A3TA4 and T3AT4) aUeles in APC 
(Figure 3A). For the perfect allele templates, the polymerases 
created -Hi A/T insertions, — 1 A/T deletions, and A:T to T:A 
tranversions that lead to TAA nonsense codons (data not shown), 
which also are the types of inactivating APC somatic mutations 



observed within tumors from I1307K carriers [44]. For the iMS 
allele templates, the polymerase indel error frequency was five- to 
50-fold lower than that for the perfect allele, depending on the 
polymerase, demonstrating strand slippage stabilization by this 
single interruption (Figure 3A; Table S7). We observed that the 
interrupting base is rarely removed by these polymerases; the 
predominant errors (>95%) are indels within the remaining 
perfect tandem repeat tracts (Figure 3B). The frequency of deleting 
the interrupting base to create a perfect allele was very low 
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Table 2. The number and types of exonic iMS in the 1000 Genomes Phase-1 dataset, by population. 




Population 


Total number of affected genes'* 


Mechanism'' 






SNP Indel 


African 


4,224 


3,478 1,213 






European 


3,029 


2,323 879 






Asian 


2,839 


2,158 819 






American 


3,174 


2,427 908 






*^The number of genes is less than the total number of IMS because some genes contain multiple Interruptions. 
"Number of IMSs. 

doi:l 0.1 371/journal.pgen.1 004498.1002 






Table 3. Examples of disease-associated genes corresponding to Gene Ontology terms significantly enriched (p<0.01) 
population-specific IMS alleles. 


in 




Gene (Chr.) 


GO terms (GOID) 


Disease Association^ 


IMS Location'' 


MS 

Motif 


AFR-specific Interruptions (perfect in EUR, AMR, ASN) 


APOB (Chr. 2) 


Positive regulation of cholesterol storage 
(GO:0010886);Lipoprotein catabollc process (GO:0042159) 


Familial hypobetallpoproteinemia 


21254729 


ACC 


ATM (Chr. 11) 


Lipoprotein catabollc process (GO:0042159); 
Histone mRNA catabolic process (GO:0071044) 


Ataxia telangiectasia; breast 
cancer susceptibility 


108170038 


TC 


ENPPl (Chr. 6) 


Inorganic diphosphate transport (GO: 0030505) 


Susceptibility to Type II diabetes, obesity 


132169558 


C 








132194964 


AT 


HTT (Chr. 4) 


Neural plate development (GO:0001840);Neural 
plate formation (GO:0021990);Citrulline 
metabolic process (GO:0000052) 


Huntington's disease 


3129695 


A 








3150527 


A 








3177754 


T 


MSRl (Chr. 8) 


Positive regulator of cholesterol storage (GO:0010886) 


Hereditary prostate cancer 


16021086 


TA 


MYH9 (Chr. 22) 


Establishment of T cell polarity (GO:0001768) 


May-Hegglin/Fechtner syndrome 


36676990 


T 








36683802 


A 


POLG (Chr. 15)/ 


Gamma DNA polymerase complex (GO:C005760) 


Progressive external opthalmoplegia; 
Mitochondrial depletion syndrome 


89867478 


ACG 


DNA2 (Chr. 10) 






70181580 


T 


SCN2A (Chr. 2) 


Sodium channel complex (GO: 0034706) 


Autism spectrum disorder; infantile epilepsy 


166242996 


A 


SCN5A (Chr. 3) 


Sodium channel complex (GO: 0034706) 


Long QT syndrome, Brugada syndrome 


38691860 


AC 


SCNNIB (Chr. 16) 


Sodium channel complex (GO: 0034706) 


Liddle syndrome 


23312745 


GA 


SPTB (Chr. 14) 


Long term strengthening of neuromuscular 
junction (GO:0042062) 


Hereditary spherocytosis; elliptocytosis 


65232615 


TG 


EUR-specific Interruptions (perfect in APR, AMR, ASN) 


SDHA (Chr.5) 


Mitochondrial respiratory chain complex II; 
Succinate dehydrogenase complex 
(GO: 0005749; 0045281) 


Leigh syndrome 


229004 


A 


SDHAF2 (Chr. 11) 


Mitochondrial electron transport; proteln-FAD 
linkage (GO:0006121; 0018293) 


Paraganglioma 2 


61205342 


T 


PCCA (Chr. 13) 


Propionyl-CoA carboxylase activity (GO:0004658) 


Propionic acidemia 


101019895 


T 


ASN-specific Interruptions (perfect in APR, AMR, EUR) 


POMC (Chr. 2) 


Types 1, 3 and 4 melanocortin receptor binding 
(GO: 0070996, 0031781; 0031782) 


Propiomelanocortin deficiency 


25384471 


GCT 


AMR-specific Interruptions (perfect in APR, ASN, EUR) 


SAG (Chr. 2) 


Rhodopsin mediated signaling pathway; rhodopsin 
mediated phototransduction; opsin binding 
(GO:0016056; 0009586; 0002046) 


OguchI disease 


234248618 


T 



^Genes with non-microsatellite variants known or suspected of being associated with disease/disease susceptibility [Online Mendelian Inheritance in Man (www.omlm. 
org)]. 

''Exact chromosomal position (in basepair) of the Interrupton(s) In the hgl9 reference genome and the indicated chromosome. 

^Sequence motif of the perfect microsatellite repeat(s). 

doi:10.1371/journal.pgen.l004498.t003 
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Table 4. Population allele frequencies for iMS in expanded microsatellite disease loci. 





Locus' 


hg19 sequence's/Variant allele sequence(s) 


Allele Frequency 










APR 


EUR 


ASN 


AMR 


Polyglutamine repeats 


AIB1 


(CAOeCAACCAQg 


0.1464 


0.3426 


0.5699 


0.4365 




a. (CAOgCAAlCAOs 


0.3455 


0.3955 


0.2115 


0.3674 




b. {CAG)3CAA(CAG)2CAA(CAG)9 


0.2012 


0.1839 


0.1329 


0.1243 




c. (CAOeCAA (CAG),o 


0.3069 


0.0780 


0.0857 


0.0718 


DRPLA 


CAGCAACAGCAA{CAG),5 


0.7032 


1.0 


1.0 


0.8232 




a. CAGCAA(CAG)i5 


0.1748 


0 


0 


0.0939 




b. CAGCAA(CAG)i6 


0.1220 


0 


0 


0.0829 


SCA1/ATXN1 


(GCA)i2CAT(CAG)CAT(CAG)i4 


0.7094 


0.8650 


0.3794 


0.7072 




a. (GCA)i2CAT(CAG)CAT(CAG)CAT(CAG)i2 


0.0772 


0.0622 


0.1556 


0.1298 




b. (GCA),2 CAT(CAG),6 


0.2134 


0.0728 


0.2622 


0.1630 




c. (GCA),2CAT{CAG)CAT(CAG)3CAT(CAG),o 


0 


0 


0.2028 


0 


SCA2/ATXN2 


(CAG),3CAA(CAG)9 


0.5224 


0.3135 


0.5385 


0.3785 




(CAG)8CAA(CAG)4CAA(CAG)9 


0.4776 


0.6865 


0.4615 


0.6215 


SCA17/TBP 


(CAG)3(CAA)3(CAG)8CAA(CAG)CAA(CAG)i5 


0.1704 


0.1667 


0.0892 


0.1105 




(CAG)3(CAA)3(CAG)9CAA(CAG)CAA(CAG),9 


0.8296 


0.8333 


0.9108 


0.8895 


Polyalanine repeats 


HOXD13 


(GCG)4GCA(GCG)2GCT (GCG); 


0.2398 


0.7407 


0.6801 


0.7431 




(GCG)4GCA (GCG)2GCT(GCG)3GCAGCG= 


0.7602 


0.2593 


0.3199 


0.2569 


FOXL2'' 


(CGG)(CCC)(CGG)(CGC)C(CGC)(CA)(CGC)2 
(ACC)(CGC)(CTG)(CGG)(CGC)(CTC)(CGG) 


0.5630 


1.0 


0.9301 


1.0 




(CGG)(CCC)(CGG)(CGC)CCCC{CA)(CGC)2 
(ACC)(CGC)(CTG)(CGG)(CGC)(CTC)(CGG) 


0.4370 


0 


0.0699 


0 



"Chromosomal locations for IMS examined are as follows: AIBI: Chr. 20, 46279815-899; DRPLA: Chr. 12, 7045879-936; SCAl: Chr. 6, 16327866-954; SCA2: Chr. 12, 
1 1 2036754-823; SCAl 7, Chr.6: 1 70870995-1 1 03; HOXDl 3: Chr. 2, 1 76957786-825; F0XL2, Chr. 3, 1 38664861-903. The following loci were examined, but no differences 
from the reference genome were observed in any of the four populations: SCA3; CBFAl; ZIC2; OPMD; H0XA13; ARX; FMR1; FMR2. 

"^Underline indicates the position of the interruption within the hgl 9 reference sequence. Bold font indicates the variant sequence (relative to the reference sequence) 
identified within the four populations examined. 

'^This allele corresponds to the non-diseased sequence reported in reference [60]. 

"There is another microsatellite present at this locus [(CGC)(GGC)(TGC)(AGC)(CGC)(AGC)(TGC)2(AGC)(CGC)(TGC)(GGC)(TGC)(CGC)]; however it showed no differences 
from the reference genome in any of the four populations. 
doi:1 0.1 371 /journal.pgen.l 004498.t004 



(9.2x10 and 2.2x10 for Pol a and Pol P, respectively), 
relative to other types of polymerase errors (Table S7). Moreover, 
the polymerase error frequencies at the residual repeats within the 
iMS alleles were similar to the error frequencies at similar short 
tandem repeats located elsewhere within the HSV-tk gene coding 
sequence (data not shown). These analyses strongly suggest that 
the single nucleotide interruption within the APC gene leads to the 
mutational death of the microsatellite. 

DNA sequence analyses of Pol r\ errors produced on the 
interrupted templates emphasized the novel mutational signature 
of this enzyme within this specific microsatellite motif (Figure 3C). 
Intriguingly, Pol r| has the unique ability to litter this iMS with 
additional errors, often creating a DNA synthesis product that is 
more random in sequence than the starting iMS template 
sequence. Despite this ability, the original interrupting base is 
maintained in the majority (79%) of Pol r\ synthesis products. 

Pathways of gaining interruptions 

Despite the clear biological significance of iMSs on human 
genome stability and disease risk, very little is known regarding the 
biochemical pathways by which interruptions arise in microsatel- 
lites. Mutational events to create interrupted alleles could be 



produced during several cellular mutagenesis pathways, including 
cytosine deamination events, the creation of abasic sites, endog- 
enous DNA damage-induced mutations and DNA polymerase 
errors, among others. We used two complementary approaches to 
gain insight into the potential pathways underlying the production 
of iMS in the human genome. Fi7:sl, the abundance of 
polymorphic interruptions and the short evolutionary time since 
divergence of the four 1 000 Genomes population groups allowed 
us to examine the types of mutations leading to population-specific 
microsatellite interruptions in detail. (We observed a high degree 
of interruption gain/loss event saturation along primate phyloge- 
netic branches, precluding us from deciphering interruption 
pathways in this data set. For instance, the resulting numbers of 
interruptions along the human or chimpanzee lineages since their 
~6 MY split were similar to that along the orangutan lineage since 
its ~ 12 MY split from the human lineage (Figure S5)). Second, the 
fact that DNA polymerases can create interruption errors during 
in vitro synthesis of microsateUite-containing templates [45,46,48] 
afforded us the opportunity to examine one biochemical pathway- 
namely, polymerase errors during DNA synthesis. 

Population genomics approach. Interruption variants 
identified in the 1000 genomes datasets were classified as either 
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iMS Pure iMS Pure iMS Pure 
Pol a Pol p Pol 1-, 

Polymerase 




New Alleles (N): 

Base Substitutions 



AAATTAAA 


C6) 


TTCAI 1 1 1 


C3) 


ATAIAAAA 


(1) 


TTAAI 1 1 1 


(2) 


AAACAAAA 


CD 


TCTAI 1 1 1 


CD 


AAAICAAA 


CD 


CTCAI 1 1 1 


CD 






TTTATTCT 


CD 






TTTACTCT 


CD 



Indel 



AAAIAAA_ C7) 

AAAAIAAAA C2) 

AAAIAAAAA CI) 

_AIAAAA CI) 

A AAA (1) 



TTTAI II II 

TATTTT 

TTTATTT_ 



C4) 
CD 
CD 
CD 



Complex 


AAACAAA^ 


C3) 


TTTCTTT_ 


C3) 


(BS & Indel) 


AATTAAA^ 


CD 


TTTATCTTT 


C2) 




AACTAAA^ 


CD 


TTTATTCAT 


CD 




_TAIAAAA 


CD 


TTCAI 1 1 1 1 


CD 




^CAAAA 


CD 


CTC_TCTT 


CD 



Figure 3. DNA polymerase error rates at interrupted microsatellites corresponding to sequences within the APC gene. (A). DNA 
polymerase indel error frequency. The Pol EF for each of the four alleles was determined separately from two independent polymerase reactions per 
single-stranded template (Table S5). Indel Pol EFs were calculated by multiplying the proportion of unit-based indel mutational events (as examples, 
Ws-^Wj for a perfect allele or A3TA4^A3TA3 for an ilVlS allele) by the microsatellite Pol EF. Numbers on the top of each column were obtained by 
adding the Indel Pol EFs of the complementary alleles in order to compare the difference in polymerase fidelity upon introduction of a single 
nucleotide polymorphism (SNP) that converts the double-stranded ilVlS sequence to a double-stranded perfect (pure) sequence. (B) Specificity of Pol 
0£ and Pol p mutational events within the IMS alleles. Proportions of mutational events found within the three-unit tandem repeat (open sectors), the 
interrupting base (black sectors), and the four-unit tandem repeat (gray sectors). Total mutational events for pols oc and p were 74 and 35, respectively 
and all were indel events. Two pol a events generated the loss of the interrupting T within the A3TA4 iMS sequence (A3TA4^[A]7 and A3TA4^[A]6). 
One similar event occurred for pol p at the T3AT4 IMS sequence (T3AT4^|T]4). (C). Pol r| mutational events within the IMS alleles generate sequence 
diversity. Events (71 total) are categorized according to the mutational mechanism that most likely created them. Red indicates individual mutational 
events. Underline indicates a missing base or bases. Number in parentheses shows the number of mutants carrying the new sequence. 
doi:1 0.1 371/journal.pgen.1 004498.g003 
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Figure 4. Pathways (substitutions, insertions, and deletions) driving the African population-specific interruptions. Repeats separated 
by (A) motif size, (B) repeat number, and (C) motif sequence for mono- and dinucleotides microsatellites. 
doi:10.1371/journal.pgen.1004498.g004 
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Figure 5. DNA polymerase interruption mutagenesis within [GT]„ and [TC]„ dinucleotide microsatellite sequences. (A) Interruption Pol 
EFs at the [GT]io, [GT|,9, and UQi, alleles for B-family (pols a and 6), X-family (pol P) and Y-family (pols k and ri) DNA polymerases. Interruption Pol 
EFs were calculated from unpublished and published [24,26,45,46,48,94] data by multiplying the proportion of interruption mutational events at each 
allele by the microsatellite Pol EF. Only detectable interruptions (ie, interruptions that produce a frameshift or a stop codon) were included in this 
analysis given that an event must be detectable to contribute toward the Pol EF. Less than symbol (<) indicates that no interruption events were 
found for pol a at the [GT],o allele; the interruption Pol EF is estimated to be <5.7x10"^. The Pol EF was not determined for Pol oc or Pol x\ using the 
GT,9 template. (B) DNA polymerases utilize signature interruption mechanisms. Pie charts depict the proportion of mutational events generated by 
each possible interruption mechanism at [GT|n and [TC]n alleles. Graphs include both detectable and undetectable interruptions. Data used in the 
[GT]n chart is a compilation of interruption events from pol (3 (N = 32) at [GT],o, [GT1,3, and [GT],9, pol k (N = 36) at [GT],o, [GT],3, and [GT],9, and pol ri 
(N = 29) at [GT],o. The |TC]n chart includes events from pol p (N = 11) at [TC],, and [TC],4, pol k (N = 21) at [TC],, and TCIm, and pol ri (N = 58) at 
|TC],i. See Supplementary Figures S7 and SB for complete representation of interruption mutations. (C) Detailed specificity of interruption events at 
[GTln and [TC]n microsatellites. Columns in blue indicate the proportion of total interruptions that are single base deletions. Columns in red indicate 
the proportion that are single base insertions and columns in black/gray indicate the proportion that are base substitutions. Data used for this 
analysis is the same as that used in (B) for pols p, k, and r\. Data in combined column indicates the specificity obtained upon combining data from all 
three polymerases. 

doi:10.1371/journal.pgen.1004498.g005 
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Table 5. Sequence diversity created in vitro by human DNA polymerase r| base substitution errors within perfect microsatellites. 



starting MIcrosatellite Template Sequence 



[Ala [Tls [GT],o [TC]„ 



iMS Mutational Events 

A T [Ale' A m? GC [GTls GA TC CC [ICU TT [YCU 

[AU T [A]3 [T]2 C [Tls GT CT [GTlj GG [GT14'' TC CC [TClg CT UCh 

[Als G a" Uls C [712 [GTI2 GC [GTla GCGC" TC TT \JCU GC [YCU 

me C T [GT]3 GG CT [GT14'' UCh CC TC TT UQ,' 

me C m2 [GTle GG TT GT" \JCh TT [TC], TT (TCls' 

[TC]3 CT [TCls' 
[TC]3 GC [TC]2TT [TCls' 
[TC]3 TT [TCls TT UQi' 
[TCls TT [TCIs TT' 



Bold, interrupting base(s). 

*^Three independent occurrences. 

"^A substitution occurred with a 1 unit deletion. 

'^A substitution occurred with a 1 unit insertion. 

^A substitution occurred with a 2 unit deletion. 

doi:1 0.1 371 /journal.pgen.l 004498.t005 



base substitutions (SNP variants) or iiidels (insertion/deletion variants 
that did not include a whole-motif insertion/deletion). The intrinsic 
properties of microsatellites (motif size, repeat number and motif 
composition) are known to be the primary factors dictating motif- 
based indel mutations within microsatellites [49,50]. Therefore, we 
examined the effect of intrinsic sequence properties on the production 
of population-specific iMS alleles (Figure 4 presents results for tiie 
African population; the results for the other three populations are very 
similar; Figures S6, S7, S8). Base substitutions are the primary 
mutation type leading to iMSs in all four population groups, all motif 
classes and repeat numbers examined (Figures 4, S6, S7, S8). The 
relative proportion of substitution-based interruptions is lower in 
tetranucleotides compared to the other three motif classes (Figures 4A, 
S6). This may reflect the fact that numerous tetranucleotide motifs 
contain proto-microsatellites of two or tliree tandem repeats (i.e., 
TTCC or TTTC), which would be expected to increase the likelihood 
of indel interruption mutations. For all motif sizes, with increasing 
repeat number, the proportion of substitution-driven interruptions 
decreases, while indel-based interruptions increases (Figures 4B, S7). 
For mono- and dinucleotide microsatellites, we observed some 
differences in the proportion of iMS alleles based on motif composition 
(Figures 4C, S8). In particular, [G/C]n alleles were found to have 
more insertion interruptions compared to [A/T]„ alleles. Interestingly, 
these trends reflect the order of .slippage-driven polymorphism 
incidence ([G/C]>[A/T] for mononucleotides), as observed in the 
1000 Genomes Pilot- 1 dataset for length polymorphisms [25]. 

Biochemical approach. DNA synthesis errors by polymer- 
ases can arise during the biochemical pathways of DNA 
replication, recombination, repair, and translesion synthesis. The 
human DNA polymerases associated with these four pathways 
constitute distinct enzymatic families and have differing inherent 
accuracies [51]. To gain insight as to which DNA polymerases 
potentially may produce iMSs in the genome, we surveyed the in 
vitro frequency of detectable IMS errors (see below) that are 
created by polymerases involved in replication (Pols ot and 5), 
recombination (Pols 5 and r\), repair (Pols |3, 8, and k), and 
translesion synthesis (Pols K and r|). Detectable interruption errors 
within dinucleotide microsatellites can be produced in our in- 
frame genetic reporter assay by single base indel errors or by base 



substitution errors that create a nonsense codon and inactivate the 
HSV-tk protein. For the dinucleotide motifs examined, we 
observed that genome-stabUizing iMSs are created most frequently 
by error-prone polymerases. The replicative human DNA 
polymerases ot (Pol a.) and 5 (Pol 5) create interruption errors 
within [GT] iq and [TC] 1 1 alleles at a very low frequency (~ 10~^; 
Figure 5A). These results are similar to our previous report for 
yeast replicative Pol 5 and Pol £ holoenzymes [52]. Conversely, the 
specialized translesion synthesis polymerases, Pol K and Pol T], 
produce a relatively high frequency (~ 1 0 ' ) of interruption errors 
within the same alleles (Figure 5A). The repair polymerase, Pol fi, 
has an intermediate interruption error frequency that ranges from 
10 to 10 depending on the allele sequence. For the 
polymerases examined, the interruption error frequency increases 
with allele length, up to 10"^ within the [GT]ig allele. Thus, iMS 
alleles within the specific microsatellites examined are readily 
created by human DNA polymerases. 

We undertook an in-depth analysis of the interruption errors 
produced by polymerases within these templates to further 
understand the potential biochemical pathways by which iMS 
may arise in the human genome through DNA polymerase errors. 
Unique iMS mutational signatures are created by each DNA 
polymerase within the [GT]„ and [TC]n alleles (Figure 5B; Table 
S8; Figures S9 and SIO). Pol r\ produced a characteristically high 
proportion of base substitution errors on both templates, while Pol 
K displayed a propensity for creating single base insertion errors 
within the dinucleotide alleles. The very few interruption errors 
produced by Pol 8 were primarily (~7Q-75%) base substitutions. 
We note that the interruption error specificity of Pols P and K were 
somewhat influenced by the motif sequence. Pol r\ produced a 
unique error profile when synthesizing perfect mono- and 
dinucleotide templates, in that the synthesis products are 
characterized by a high degree of sequence diversity (Table 5). 
A fuU 20% of Pol r| lMS DNA products contained two or more 
interruptions (14/71); in comparison, the related Y family Pol K 
produced 2 or more interruptions in only 3% (2/55) of cases. 

We pooled all interruption errors created by the three 
polymerases most frequently producing iMS errors (Pols (3, K, r|) 
using five dinucleotide templates, and calculated the proportion of 
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interruption errors via base substitution, single base deletion, and 
insertion events (Figure 5C). For the [GT] 10-19 dinucleotide motif, 
the majority of iMSs arose by base substitution errors (60%), 
followed by single base deletion errors (29%), and then by single 
base insertion errors (1 1 %). A similar trend was observed for errors 
produced within the [TC] 11-14 templates (71% base substitutions). 
These polymerase data for the types of IMS errors produced 
within dinucleotide alleles in vitro are in concordance with the 
human genome data for the types of interrupted dinucleotide 
microsatellites observed genome-wide (Figure 4C). 

Discussion 

In this study, we answered three biologically significant 
questions regarding mono-, di- tri- and tetranucleotide microsat- 
eUite interruptions in the human genome. First, using primate 
genome alignments, we quantified the genome-wide effect of 
interruptions on decreasing microsatellite mutability, and found it 
can be significant and strong - from several fold to several orders 
of magnitude, compared with perfect repeats. Second, utilizing the 
1000 Genomes Phase- 1 dataset, we found iMS polymorphisms to 
be highly abundant and informative for predicting population- 
specific microsatellite stabilization, especially for exonic loci. The 
vast majority of the population-specific, exonic iMSs we identified 
are not expected to alter protein sequence; thus, the primary effect 
of interruptions may be to modulate the mutation rate of the 
underlying microsatellite. Third, we discovered that base substi- 
tutions are the primary type of interruption among MSs in all 
population groups, and for the four microsatellite classes 
examined. We surveyed five mammalian DNA polymerases 
involved in DNA replication, repair, and specialized functions, 
and found that, for the mono- and dinucleotide microsatellite 
sequences analyzed, iMSs are created most frequentiy by error- 
prone polymerases. Pol r\ is notable among the enzymes examined 
in that the microsatellite DNA synthesis products are character- 
ized by a high degree of sequence di\ ersity. 

Contribution of interruptions to human genome 
stabilization 

Early studies of microsatellite interruptions demonstrated 
reduced mutation rates at a few iMS loci, as compared with 
perfect alleles of the same repeat number [28—30,44]. A higher 
mutability of microsatellites was observed for interruptions closest 
to the repeat tract ends, as compared with centrally located 
interruptions [31,33,5.3,54]. Such studies suggested that interrup- 
tions might effectively divide microsatellites into shorter repeat 
runs. Within the interrupted repeat itself, the mutation rates of the 
individual arms depend on the lengths of perfect tracts remaining 
within the iMS allele [55]. 

Here, we pro\'idi' a detailed, genome-wide analysis of the 
mutability of perfixt and interrupted MSs in completely sequenced 
primate gen()m(;s. For the four motif sizes examined, interruptions 
significantiy reduced mutability when present (a) within shorter 
microsatellites, (b) in multiple numbers (i.e., two interruptions per 
microsatellite), or (c) near the center of the microsatellite (Figure 1) 
- all of which give rise to a shorter perfect repeat tract. 
Importantly, the magnitude of the effect of interruptions on 
microsatellite allele length variation ranged from a few-fold to 
several orders of magnitude for loci across the genome. 

We also report here that the perfect microsatellites in the human 
reference genome analyzed here (SlO units in length) are 
frequently found as iMS polymorphisms within the genomes of 
individuals from four population groups. Although the majority of 
iMS alleles were shared among all groups, many of the iMS alleles 



we detected were specific to only one population group, or shared 
between subsets of population groups (Figure 2). Our quantitative 
results for the stabilizing effects of interruptions in short 
microsatellites are biologically relevant here, as the vast majority 
of iMSs we identified in human genomes are within short 
microsatellites, just above the length threshold. Therefore, 
interruptions are expected to have a strong effect on stabilizing 
such microsatellites. Thus, iMSs are a likely source of population- 
specific genetic variants that can affect the stability of individual 
genomes by reducing the mutability of microsatellites. To the best 
of our knowledge, this is the first report of iMSs as an abundant 
source of population-specific genetic modifiers in the human 
genome. The full abundance of iMSs within the human genome 
must await future studies, when improvements in sequencing 
technology read length and accuracy will allow the interrogation 
of all microsatellite motif sizes, lengths, and sequences that are 
present within individual genomes. 

Impact of interruptions on genome function 

The APC tumor suppressor gene illustrates a provocative 
example in which a single, population-specific, germline SNP can 
affect disease risk by altering the mutagenic potential of a 
microsatellite scquenci;. Our data directiy support the previous 
model that the perfect [Ag/Tg] allele creates a hypermutable region 
within the APC gene, leading to cancer predisposition [19]. We 
measured DNA polymerase strand slippage error rates that are up 
to 50-fold lower during replication of the iMS sequences [A3TA4/ 
T3AT4], compared to the perfect secjuences [A|./T|.] (Figure 3). 
Previous biochemical studies of trinucleotide microsatellites have 
shown that interruptions decrease sUpped strand formation [56] and 
decrease the thermostability of secondary structures formed by 
repetitive sequences [57]. Our results advance these studies by 
demonstrating that the mechanism of reduced mutability by an 
interruption within a mononucleotide A/T allele is lowered 
polymerase strand slippage errors during DNA synthesis. 

Expanding on the APC gene observation that SNPs can create 
perfect microsatellites and hypermutable sequences in disease 
states, we identified ~ 3,000-4,000 genes (depending on the 
population group) that are perfect in the reference genome, but 
contain iMS within exonic regions (Figure 2). The exonic iMS 
alleles that are specific to only one or two populations likely 
represent a pool of genes that are at a risk for increased mutation 
in the other population groups. Madsen and colleagues reported 
that short tandem repeats/microsatellites in exons are overrepre- 
sented among human genes associated with cancer and immune 
system diseases [58]. We observed that while European-specific 
iMSs are enriched in genes associated with immunological 
function, African-specific iMSs are enriched in genes associated 
with neurological function. Thus, population-specific differences in 
microsatellite architecture (perfect versus interrupted) may be a 
widespread mechanism by which genetic ancestry impacts 
individual disease risk. While our focus has been on comparing 
population groups, our Est analysis indicated that many iMS 
alleles are not fixed within population groups, thus potentially 
providing a rich source of individual genetic variability. 

Perfect microsatellites are at a higher risk for microsatellite 
expansion mutations that are causative for numerous neurologi- 
cal/neurodegenerative diseases [3] , and the presence of interrupt- 
ed alleles has been well documented to decrease disease risk. We 
investigated several genes previously described as harboring 
disease-associated, coding iMS alleles [38]. The genetic conse- 
quences of the iMS variants we identified include both sequences 
that are expected to increase mutability, and sequences that are 
expected to decrease mutability. Various AIBl iMS alleles have 
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been noted previously in a survey of European DNA samples [59], 
consistent with tlie allelic distribution we observed for the 1000 
Genomes European population group. One of the iMS variants 
we identified within AIBl occurs at a much higher allele 
frequency in the African population, and is expected to display 
higher mutability than the reference sequence, due to an increased 
perfect tandem repeat tract length. The two H0XD13 IMS alleles 
we identified were observed previously in a pedigree analysis of 16 
synpolydactyly families [60]. Importandy, repeat expansions in 
these families segregated with the disease phenotype; however, the 
iMSs were retained in all of the expanded alleles. Recently, 
amyotophic lateral sclerosis patients have been described as ha\ ing 
moderately expanded SCA2 IMS alleles that retain at least one- of 
the interruptions [20,61]. Both microsateUite length and purity 
(interruption) SCAl and 5^42 polymorphisms have been 
described among unaffected individuals [62,6,3], consistent with 
the IMS variant alleles we detected in this study. 

Pathways leading to microsateUite interruption 

The pathways by which iMSs arise in genomes have not been 
extensively studied. Several cellular mechanisms could account for 
the production of IMS alleles in genomes, including (but not 
limited to) endogenous DNA damage-induced mutations and 
DNA synthesis errors during DNA repKcation, repair and/ or 
recombination. The types of iMS ultimately observed in human 
genomes will be further shaped by DNA repair pathways and 
selection, which will serve to reduce the number of and narrow the 
types of mutational events within microsatellites. We demonstrate 
here that base substitutions are the primary type of IMS present in 
individual human genomes. We also used our established 
biochemical assay to determine the potential contribution of 
errors created by three distinct DNA polymerase families to the 
formation of iMS alleles. For the microsateUite templates and types 
of detectable errors examined, we observed that genome 
stabilizing microsateUite intc^rruptions are created most frequently 
in vitro by error-prone, specialized Pols r| and k, while replicative 
Pols a and 5 rarely created interruptions (Figure 5). The generality 
of our observations for aU microsateUite sequences and human 
polymerases is not known, and must await future experimental 
analyses. Nevertheless, we observed that DNA Pol r| is very 
efficient at making interruptions within perfect microsatellites and 
creates multiple errors within a single DNA synthetic event. Pol r\ 
also creates base substitution errors within the tandem repeat 
tracts of iMS templates, with the net result being a more random 
sequence. DNA Pol r| serves several important functions in human 
genome stability. Germline mutations leading to loss of Pol r| 
activity causes the cancer predisposition syndrome, xeroderma 
pigmentosum-variant [64], and enhanced cellular UV sensitivity 
[65]. Pol r] has been weU- characterized biochemically, and is 
capable of accurate translesion synthesis across UV photoproducts 
and other DNA lesions [64,66] . Human Pol r\ also is required for 
the maintenance of common fragile sites and prevention of 
chromosomal rearrangments [67,68]. On the other hand, Pol r| 
performs a key role in targeted mutagenesis during somatic 
hypermutation of immunoglobulin genes, primarily targeting 
mutations to A:T basepairs [69-7 1] . Here, we show in vitro that 
Pol r| litters mononucleotide A/T microsateUites with many base 
substitution errors (Figure 3C and Table 5), an error characteristic 
that is highly reminiscent of somatic hypermutation. 

Interruptions - The result of an interplay of replication, 
repair, and recombination 

Previous studies of primate MSs reported that point mutations 
occur more frequentiy than expected within microsateUites, 



based on the overaU genome divergence [72], and that there is a 
two-fold higher rate of base substitutions within coding 
microsatellites relative to other coding sequences [73]. In a 
study of microsateUite births and deaths, we observed that 
substitutions were the leading cause of death, and that the 
density of births/deaths is non-random throughout the genome 
[27]. Although interruptions can be removed from microsatel- 
lites, restoring long perfect repeat stretches and high mutabUity 
of microsatellites [27], our in vitro results suggest that this may 
be a rare event during DNA synthesis based on the small number 
of microsatellites examined. 

Our discovery that interruptions are created more frequentiy by 
low fidelity repair and specialized polymerases than by high 
fidelity replicative polymerases suggests one potential mechanistic 
explanation for these observations. Based on our data to date, we 
would predict that the frequency of interruptions among 
microsatellites in the genome (of the same motif and number) 
wUl depend upon the relative activities of replication, repair and 
recombination DNA synthesis pathways, such that more iMSs are 
expected in genomic regions where either repair or specialized 
polymerases, such as Pols r], K and fi, are more frequentiy 
engaged. DNA synthesis by these polymerases would have the 
consequence of speeding up microsateUite death and impeding 
microsateUite resurrection [74]. For example, specialized poly- 
merases may be engaged at the replication fork more often during 
synthesis of highly repetitive microsateUite sequences than of 
coding sequences, because replicative polymerases are inhibited 
[46,68,75]. Indeed, Pol k was recentiy implicated in the synthesis 
of DNA at staUed replication forks in unstressed human ceUs [76] . 
Alternatively, an increased level of DNA damage within micro- 
satellites, relative to coding sequences, would necessarily engage 
repair and specialized polymerases during the downstream 
pathways of gap-fiUing or translesion synthesis, respectively. A 
noncanonical pathway of mismatch repair that is activated by 
DNA lesions was shown to recruit Pol r| to chromatin in a 
replication-independent manner [77]. Finally, Pol r| activity may 
be targeted to specific genomic sequences, such as the highly 
mutable hotspots identified for somatic hypermutation of immu- 
noglobulin genes. 

Perspective 

MicrosateUites present within regulatory regions of the 
genome can affect gene expression, and allele length polymor- 
phisms are increasingly recognized as contributing to pheno- 
typic variation and disease risk [5,10,12]. Indeed, it has been 
previously proposed that polymorphic microsateUite alleles 
present within candidate genes associated with a disease or 
trait should be considered as contributing to the trait [11]. 
Genomic microsatellites display genetic variation that includes 
both allele length and sequence polymorphisms. The genetic 
architecture of microsatellites can include stabilizing, interrupt- 
ed alleles. Our study advances our understanding of the impact 
of microsateUite sequence variation by illuminating the sheer 
abundance of iMS alleles within individual human genomes and 
the magnitude of the genome stabUization effects. We have 
identified genes encoding exonic microsatellites that are present 
as protective, interrupted alleles in only one of four human 
population groups. These population-specific, iMS-containing 
genes are enriched in distinct functional pathways, suggesting 
that microsateUite sequence variation may contribute to the 
effects of genetic ancestry on disease risk. Importantly, our 
analyses demonstrate that many iMS alleles are not fixed within 
population groups, suggesting that microsateUite interruptions 
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could be a source of genetic variability impacting individual 
phenotypic variation. 

Materials and Methods 

Identification of orthologous microsatellites in primate 
genomes 

We identified perfect as well as interrupted microsatellites in 
human (hgl8), chimpanzee (panTro2), orangutan (ponAbe2), 
macaque (rheMac2) and marmoset (caljacl) genomes using 
Sputnik [78] and a computational pipeline that we developed 
for proper extraction of iMSs (see below). In this approach, 
Sputnik is utilized to perform a genome-wide search for 
microsatellite 'seeds' (see Table SI for search parameters) i.e., 
stretches of perfect mono-, di-, tri- and tetra-nucleotide repeats at 
or above the threshold repeat lengths of 9, 5, 4 and 3 units, 
respectively (following [24,79]). Each seed's (e.g. [AC]6) flanking 
sequences are examined for the presence of (a) any additional 
seeds of any motif, or (b) additional instances of the repeat motif 
(e.g. [AC] 2) with the intervening non-repeat nucleotides extending 
to not more than the length of the repeat motif itself (here, 2 bp). If 
additional complete repeats of the repeating motif or seeds 
composed of the same repeat motif are identified in the 
neighborhood of the seed, then the focal seed and the discovered 
extensions are merged into a single microsatellite. To complete the 
above example, if the focal seed [AC]^ exists such that (a) on its 3' 
end, following a dinucleotide GT, there was discovered another 
seed [AC] 7, and (b) on its 5' end an immediately adjacent instance 
of [CA]2 is found, then the resultant focal seed is extended to 
include these additional repeats such that the final repeat becomes 
[AC]7GT[AC]fi[CA]2. This extension process is continued itera- 
tively into the flanking regions until no more additional instances 
of the focal motif are identified, or if the terminal additions to the 
microsatellites are composed of repeat instances that are smaller 
than two repeats long. After the extension process is terminated, 
each repeat is classified as an iMS if the above microsatellite 
extension process was possible, and as a perfect microsatellite if the 
extension was not possible. Compound microsatellites, created 
when adjacent seeds were composed of different motifs, are 
discarded. 

We then identified orthologous microsatellites using the publicly 
available multiZ alignments of primate genomes [80] . From the 
identified set of orthologous microsatellites, we remo\'ed those that 
(1) were located within 25 bp of each other; (2) possessed at least 
one nucleotide of low sequence quality (namely, with PHRED 
score below 20); (3) had low-complexity flanking (20 bp upstream 
and 20 bp downstream) sequences; (4) had flanking sequence 
identity below 85% between any species pair; (5) differed in 
nucleotide sequence of the repeating motif, (6) had more than two 
interruptions in any species; (7) were interrupted microsatellites 
but differed in the sequence of the interrupting nucleotide(s) 
between species; (8) were interrupted microsatellite loci that 
differed in the context of the interruption (i.e., the repeat 
nucleotides immediately flanking the interruption) between species 
(Table SI). Our final set of microsatellite loci consisted of 30,715 
perfect orthologous microsatellite loci and 46,356 orthologous 
microsatellites with one or two interruptions in the studied species. 

The size of each IMS was measured in terms of repeat numbers 
and was calculated by dividing the total length of microsatellite- 
native sequence (i.e., all sequence other than the interrupting 
nucleotides) by the size of the repeating motif. Mutability values 
and their respective 95% confidence intervals (CI) were measured 
at multiple repeat numbers for microsatellites with 0, 1 and 2 



interruptions separately, using methods previously implemented in 
[50]. 

Identification of interruptions using the 1000 genomes 
Phase-1 dataset 

We obtained variant calls (SNPs and indels) from the 1000 
Genomes Phase-1 Project [36] for four population groups — 
Africans, Europeans, Asians and Americans. These calls were 
intersected with perfect microsatellites (mono-, di-, tri-, and tetra- 
nucleotide repeats of length ranges 8-10, 10-18, 12-24, and 16- 
28 bp respectively) identified from the human reference genome 
(UCSC build hgl9) - the lower bounds of the chosen length ranges 
represent microsatellite thresholds and the upper bounds represent 
the length up to which indel calls generated from short-reads are 
reliable (see [25] for details). AU indel and SNP variants present at 
an allele frequency SO. 05 were identified separately for each 
population group. These variants were considered to be interrup- 
tions if they were located within a microsatellite but not at the 
starting/ ending repeat unit. Additionally, for indels, only those 
indels that did not include a whole-motif insertion/ deletion were 
considered to be interruptions. We next compared the list of IMS 
loci across populations to identify microsateUites interrupted in all 
populations and in subsets of populations. Population-specific 
interruptions were defined as those that are interrupted in one 
population, but remain perfect in the other three. We obtained 
coordinates of disease-assoi:iated loci [38] from the UCSC 
Genome Browser [81,82], and intersected the 1000 Genomes 
Phase-1 Project variant calls to identify interruptions at these loci 
across the four population groups. Again, we used the allele 
frequency cut-off of 0.05 and the aforementioned filters to identify 
interruptions. 

FsT estimation 

For interruptions present in all four population groups, the 
frequencies of the interruption variant alleles (p) were extracted for 
each of the four population groups. For each interruption, 
heterozygozity {H = 2pq) values were computed separately for 
each population group, where q=l-p denotes the frequency of the 
reference aUele. The average of these population heterozygozities 
was computed as H<;. Next, the average allele frequencies for the 
total population (P, Q) were computed by averaging the allele 
frequencies (p and q) over the four populations. Next, total 
heterozygosity was estimated as Hx = 2PQ^. Fst was then 
estimated as Fst = {Ht-Hs)/Ht [83]. 

Heterozygosity estimation and significance testing 

Population allele frecjuencies for the variant iMSs as well as 
perfect microsateUites (those without interrupting variants) were 
obtained from the VCF files, and heterozygosity was estimated as 
2pq, where, p — aRelc frequency of the variant and q=l-p. 
Frequencies of iMSs and perfect microsatellites were estimated 
at difiFerent heterozygosity bins (ranging from 0 to 0.5, with bin- 
size equal to 0.02), and the distributions of these frequencies were 
compared against each other using two-sample bootstrap Kokno- 
gorov-Smirnov test with 10,000 iterations from the R "Matching" 
package [84]. 

LD estimation and phenotype association 

Pairwise correlation coefficient, (proxy for LD), was 

calculated between interruption-causing indels and neighboring 
(located within a 1-Mb window around the indel), population- 
matched SNPs from the 1000 Genomes Phase-1 dataset using 
PLINK vl.07 (http://pngu.mgh.harvard.edu/purcell/plink/) 
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[85]. For each indel, SNPs with the maximum values were 
chosen for subsequent analysis. Indel-SNP pairs that showed a 
perfect LD (R^ = 1) were selected and intersected with a list of 
exon coordinates to identify exonic indel-SNP pairs in perfect LD 
using Galaxy. The SNPs from such perfect LD pairs were 
submitted to SNPnexus to obtain phenotype and disease 
associations. 

Gene Ontology analyses 

iMS loci were intersected with exon coordinates obtained from 
the UCSC Genome Browser [81,82] using Galaxy [86,87,88] and 
HUGO gene names [89] were obtained for exonic iMS. Using 
functions from the R package "GOstats" [90], we compared the 
exonic iMS-containing genes with all other genes in the genome to 
determine an over/underrepresentation of GO molecular func- 
tions, biochemical processes and cellular components in the 
selected gene set. 

In vitro polymerase assay 

Purified calf thymus pol a-primase complex (pol a) was kindly 
supplied by Dr. Fred Perrino or the human complex was 
purchased from Chimerx (Madison, WI). Recombinant DNA 
pol P was purified as described [91]. The 4-subunit recombinant 
human Pol 54 was purified as described [92] and was a generous 
gift of Dr. Marietta Lee. Purified full-length human pol k and pol 
ri were purchased from Enzymax (Lexington, KY). [GT]„ and 
[TC]„ microsatellite-containing herpes simplex virus type 1 
thymidine kinase (HSV-tk) vectors have been previously described 
[26,45]. Dinucleotide microsatellites were inserted in-frame 
between positions 111 and 112 of the HSV-tk sense strand. 
Additional vectors were constructed with in-frame inserts in the 
same position as above and the final sequences of [T]h, [AJr, [Tj^ 
A [T] 1 and [A] 3 T [Aj i. Th(;se sequences model the perfect and 
interrupted (iMS) alleles found within the APC gene (positions 
3917-3924) of the Ashkenazi Jewish and non-Ashkenazi popula- 
tions, respectively [19]. 

Linear DNA fragments and ssDNA were used to construct Mlul 
(position 83) to StuI (position 180) gapped duplex (GD) molecules, 
as described [47,93]. In vitro polymerase reactions for pol ot [94], 
pol P [45], and pols 5, k, and r] [46] at dinucleotide microsatellite 
templates were previously described. For the APC gene model 
templates, polymerase reactions contained 1 pmol of oligonucle- 
otide-primed ssDNA at 20 nM concentration. Reaction conditions 
were the same as in the references above except 20 units of 
Chimerx human pol a, 15 pmol of pol P, and 1-2 pmol of pol r| 
were used. To sample reaction products for mutations, small 
fragments were prepared by Mlul and StuI digestion and 
hybridized to the corresponding GD molecule as described [45]. 
Successful hybridization was verified by agarose gel analysis as 
described [52]. An aliquot of DNA from the final hybridization 
was used to transform E.coli strain FT334 for mutant frequency 
determination on VBA selective media [47]. The presence of 
50 (tg/mL chloramphenicol (Cm) selects for progeny of the 
polymerase-synthesized strand and the presence of 40 jtM 5- 
fluoro-2'-deoxyuridine (FUdR) selects for bacteria carrying HSV- 
tk mutant plasmids. The observed HSV-tk mutant frequency (MF) 
is the number of FUdR'*'Cm^ colonies divided by the number of 
Cm'*^ colonies. To control for pre-existing mutations, we also 
determined the HSV-tk MF for each ssDNA used to construct the 
GD molecules. Independent mutants for DNA sequence analyses 
were isolated as described [47] from two polymerase reactions per 
template. The DNA sequence of the HSV-tk gene in the Mlul- 
StuI region of each mutant was determined by dideoxy DNA 
sequence analysis of plasmid DNA as described [45]. 



In vitro polymerase mutational specificity calculations 

Pol T\ and Pol K produce multiple mutational events per target 
sequence. In order to properly compare polymerase error 
frequencies (Pol EFs) among polymerases, we identified those 
mutational events that were detectable as single mutational events, 
and adjusted the observed HSV-tk MF to reflect multiple errors 
per target. First, Pol EFs were determined by the following 
equation: Pol EF = (Observed MF) - (ssDNA background MF) - 
(Outside target MF), where outside target MF is the frequency of 
errors occurring outside the gap target. Next, each mutational 
event was scor(;d as detectable or undetectable. All frameshifts 
and those base substitutions that caused an amino acid change 
or a stop codon within coding sequences were considered 
detectable. Base substitutions within microsatellite sequences 
were only considered detectable when a stop codon was 
produced. Only detectable events were used for determining 
Pol EFj3(. Each mutational event was also scored as tandem or 
nontandem. Tandem events were those adjacent to one another, 
whereas nontandem were errors > 1 nt apart. Pol EFs were then 
corrected for the existence of multiple nontandem mutations as 
described [46] . The Pol EF^.j., obtained is the overall Pol EF,,st 
and includes mutational events within the microsatellite 
sequence and within the adjacent HSV-tk coding sequence 
(see Table S5 and accompanying footnotes). The Pol EFjst of a 
specific type of mutational event was calculated from the 
proportion of the specific mutational event (among the total 
analyzed) multiplied by Pol EF^s,. For analyses presented herein, 
we further subdivided the microsatellite Pol EF,,s, into unit- 
based indel Pol EFest or interruption Pol EFesf A unit-based 
indel is an error that occurs when an entire microsatellite unit 
or units are inserted or deleted (i.e., [GT] ]o^[GT]9). An 
interruption is an indel or base substitution that disrupts 
the repetitive nature of the microsatellite sequence (i.e., 
[GT]io^[GT]5T[GT]5). 

Supporting Information 

Dataset SI All interrupted microsatellites in the four 1000 
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(PDF) 

Dataset S2 .\frican (AFR) population-specific, exonic interrupt- 
ed microsatellites. 
(PDF) 

Dataset S3 American (AMR) population-specific, exonic inter- 
rupted microsatellites. 
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microsatellites. 
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Dataset S5 European (EUR) population-specific, exonic inter- 
rupted microsatellites. 
(PDF) 

Dataset S6 Fixation index values of interrupted microsatellites 
in four 1000 genomes populations. 

Figure SI Effect of interruption identity on microsatellite 
mutability. Mutability of singly-interrupted poly-A microsatellites 
binned according to their interruption ([A]nT[A]n, [A]nC[A]n 
and [A]„G[A]„). Number of repeats of a microsatellite was 
calculated by dividing the total length of the microsatellite, 
excepting the interrupting nucleotides, by the size of its repeating 
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motif. At each repeat number the lines designate the 2.5th and 
97.5* percentiles of empirical distributions that were obtained 
through resampling. 
(PDF) 

Figure S2 The effect of microsatellite interruptions on protein- 
coding sequences. (A). Interruptions present in more than one 
population group. (B). Interruptions present in individual popu- 
lation groups. 
(PDF) 

Figure S3 Proportion of iMS and perfect MS alleles at different 
levels of heterozygosity. (A). African population; (B). Asian 
population; (C). European population; (D). American population. 
The density of iMSs with heterozygosity below 10% is likely an 
underestimate since our data did not include variants with 
frequency below 5%. 
(PDF) 

Figure S4 Proportion of iMS alleles at different levels of linkage 
disequilibrium with neighboring, population-matched SNPs. 
(PDF) 

Figure S5 Microsatellite loci are saturated by gain/loss events. 
Numbers in blue and red indicate the number of interruptions 
gained and lost in the respective branch of the five-species primate 
tree. 
(PDF) 

Figure S6 Effect of motif size on population-specific interrup- 
tions in 1000 genomes datasets. (A). American population; (B). 
Asian population; (C). European population. 
(PDF) 

Figure S7 Effect of repeat number on population-specific 
interruptions in 1000 genomes datasets. (A). American population; 
(B). Asian population; (C). European population. Individual panels 
are data for mono-, di-, tri-, and tetranucleotide microsateUites 
within each population. 
(PDF) 

Figure S8 Effect of motif composition on population-specific 
interruptions in 1000 genomes datasets. (A). American population; 

(B). Asian population; (C). European population. Individual panels 
are data for mono- and dinucleotide microsateUites within each 
population. 
(PDF) 

Figure S9 Unique mutational signatures of polymerase inter- 
ruption errors within [GT] dinucleotide microsatellites. (A). GTIO 
template; (B). GT13 template; (C). GT19 template. DNA synthesis 
proceeds from right to left. The middle line of sequence is the in- 
frame wild-type HSV-tk gene and subscripts indicate each 
dinucleotide unit of the microsatellite. Indels are shown above 
the sequence and base substitutions are shown below. Each symbol 
represents one mutational event: (A) one base deletion; (♦) two 
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